Actuarial Data Science - Open Learning Resource
In this lecture, we focus on framing problems: turning a vague business concern into a clear, answerable data science question. A well-posed question saves you time later, as it guides what data are needed and which methods are appropriate.
Understand and explain different types of questions
Justify the importance of a strong understanding of the business, including its objectives, constraints, and operating environment, when designing and implementing a data analytics project
Data analysis is a highly iterative and non-linear process.
The data analysis process can be viewed as a specific application of the Actuarial Control Cycle, which we refer to as the Data Science Lifecycle (DSL).
Here, we zoom in on the first step of the DSL: problem definition. You should start thinking about where your project sits in this lifecycle, and how a clear problem statement will affect all downstream steps.
Data Science Lifecycle
The epicycle consists of three iterative steps:
You can use the information about the different question types and the characteristics of good questions as a guide to refining your question. To do this, iterate through the following three steps:
Establishing your expectations about the question
Gathering information about the question
Determining whether your expectations match the information you gathered, and refining your question (or expectations) if they do not
Examples:
Examples:
Source: adapted from Leek and Peng (2015)
Examples:
Examples:
Examples:
Examples:
“We have found that the most frequent failure in data analysis is mistaking the type of question being considered.” — Leek and Peng (2015)
Data analysis flowchart for question types (Source: Leek and Peng (2015))
Example:
Specify the question type your team identified in the previous activity
Share the refined question and its type by replying to your previous post in the Teams channel
Challenge your question:
What makes a good question?
Adapted from Peng and Matsui (2015), see Chapter 3.3 of The Art of Data Science for details
Adapted from Peng and Matsui (2015), see Chapter 3.4 of The Art of Data Science for details
Assume you are an actuary working for a consulting firm. Two teams (Team A and Team B) are working on different projects.
Team A works on forecasting demand for a grocery retailer. This project aims to estimate demand so that the supply chain knows how much product to send to stores. Once demand is predicted, the supply chain can determine how much to order, when to order it, and where to send it. Over-forecasting leads to waste, while under-forecasting leads to lost sales. Prediction accuracy is the most important consideration for this project.
Team B works on predicting claims for a motor insurer, specifically for its comprehensive insurance product. This project aims to predict both the number and the cost of claims for the next year in order to set appropriate premiums for prospective policyholders. Over-estimating claims will result in non-competitive premiums, while under-estimating claims could lead to losses. Interpretability and ease of implementation of the model are more important than prediction accuracy for this project. However, the model should still be reasonably predictive.
Question:
